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Abstract — This paper examines the ability of greedy algorithms 
to estimate a block sparse parameter vector from noisy measure- 
ments. In particular, block sparse versions of the orthogonal 
matching pursuit and thresholding algorithms are analyzed 
under both adversarial and Gaussian noise models. In the 
adversarial setting, it is shown that estimation accuracy comes 
within a constant factor of the noise power. Under Gaussian noise, 
the Cramer-Rao bound is derived, and it is shown that the greedy 
techniques come close to this bound at high SNR. The guarantees 
are numerically compared with the actual performance of block 
and non-block algorithms, highlighting the advantages of block 
sparse techniques. 

I. Introduction 

The success of signal processing techniques depends to a 
large extent on the availability of an appropriate model which 
captures our knowledge of the system under consideration 
and translates it to a productive mathematical framework. 
There is consequently an ongoing search for mathematical 
models which can accurately describe real-world signals. In 
recent years, much research has been devoted to the sparse 
representation model, which stems from the observation that 
many signals can be approximated using a small number of 
elements, or "atoms," chosen from a large dictionary [1]- 
[3]. Thus, we may write y = Dx + w, where the signal 
y is a linear combination of a small number of columns 
of the dictionary matrix D, corrupted by noise w. Since 
only a small number of elements of D are required for this 
representation, the vector x is sparse, i.e., most of its entries 
equal 0. It turns out that the sparsity assumption can be used 
to accurately estimate x from y, even when the number of 
possible atoms (and thus, the length of x) is greater than 
the number of measurements in y [2], [4], [5]. This model 
has been used to great advantage in many fundamental fields 
of signal processing, including compressed sensing [1], [2], 
denoising [6], deblurring [7], and interpolation [8]. 

The assumption of sparsity is an example of a much more 
general class of signal models which can be described as 
a union of subspaces [9] — [1 1]. Indeed, each support pattern 
defines a subspace of the space of possible parameter vectors. 
Saying that the parameter contains no more than k nonzero 
entries is equivalent to stating that x belongs to the union 
of all such subspaces. Unions of subspaces are proving to be 
a powerful generalization of the sparsity model. Apart from 
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ordinary sparsity, unions of subspaces have been applied to 
estimate signals as diverse as pulse streams [12], [13], multi- 
band communications [14]— [16], and block sparse vectors 
[11], [17]— [19], the latter being the focus of this paper. The 
common thread running through these applications is the 
ability to exploit the union of subspaces structure in order 
to achieve accurate reconstruction of signals from a very low 
number of measurements. 

The block sparsity model is based on the realization that in 
many practical sparse representation settings, not all support 
patterns are equally likely. Specifically, if a particular element 
of x is nonzero, then in many cases "similar" elements in 
x are also nonzero. The precise definition of similarity is 
context-dependent. For example, in Fourier-based dictionaries, 
neighboring frequency bins are often jointly nonzero, while in 
wavelet-based dictionaries, nonzero entries in a certain detail 
level are likely to be correlated with nonzeros in higher detail 
levels. Consequently, the sparsity model does not incorporate 
all of the structure present in the signal. The block sparsity 
approach aims to partially overcome this drawback by par- 
titioning the vector x into blocks, each of which contains a 
small number of elements. The structure imposed by the block 
sparsity model is that no more than a small number k of blocks 
are nonzero. The model thus favors the use of related atoms, 
rather than sporadic dictionary columns. Consequently, block 
sparsity is well-suited for those situations described above, in 
which specific atoms tend to be used together. 

The usefulness of a model depends on the existence of 
efficient and effective methods for estimating a signal x from 
its measurements. Fortunately, estimators designed for the 
ordinary sparsity model can be readily adapted to the block 
sparse setting. Thus, previous work has described techniques 
such as block orthogonal matching pursuit (BOMP) [19] and 
the mixed ^/^-optimization (L-OPT) [11], [18], the latter 
being a block version of the Lasso. In this paper, we also 
describe a block-sparse version of the thresholding algorithm, 
which we refer to as block-thresholding (BTH). The BOMP 
and BTH approaches are representatives of a class of so-called 
greedy algorithms, which attempt to identify the support of x 
by choosing at each step the most likely candidate. In this 
paper we restrict attention to these greedy techniques, which 
are simpler (and more naive) than convex relaxation techniques 
such as L-OPT, and are therefore more suitable for implemen- 
tation in large-scale or computationally parsimonious settings. 

Having described various estimation algorithms, it is nat- 
ural to ask what can be guaranteed analytically about the 
performance of these methods in practice. For example, in 
the ordinary (non-block) sparsity setting, a rich collection of 
performance guarantees exists for various algorithms under 
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different noise models. In particular, a distinction is made 
between adversarial and random noise models. In the former 
case, nothing is known about w except that it is bounded, 
1 1 tt7 1 1 2 < e; in particular, w might be chosen so as to maximally 
harm a given estimation algorithm. Consequently, guarantees 
in this case are relatively weak, ensuring only that the error in 
x is on the order of e [2], [4], [5]. By contrast, when the noise 
is random, estimation performance is considerably improved 
for most noise realizations [4], [20], [21]. 

It is natural to seek an extension of these results to the block 
sparsity model. In the absence of noise, successful recovery of 
a block sparse parameter x from measurements y = Dx has 
been demonstrated in the past for both BOMP and L-OPT [11], 
[19]. However, to the best of our knowledge, the only result 
providing analytical guarantees for a block sparse estimator 
under noise was given in [11], where the performance of L- 
OPT was analyzed under adversarial noise. The goal of this 
paper is to analyze the performance of the greedy algorithms 
BOMP and BTH under both adversarial and random noise 
models. As we will see, despite the fact that these greedy 
algorithms are simpler and more efficient to implement, their 
performance is close to the optimal achievable results. 

Specifically, we first analyze the adversarial noise model, 
and show that both BOMP and BTH achieve an error on the 
order of e when the noise is bounded by \\w 1 1 2 < £■ These 
results generalize previous guarantees in several ways: First, 
when each block contains one element, we recover the non- 
block sparsity guarantee of Donoho et al. [5]. Second, when 
the noise bound e equals 0, we obtain the noise-free guarantees 
of Eldaret al. [19]. 

We next turn to the random noise model, and examine in 
particular the case in which w is white Gaussian noise. We 
derive the Cramer-Rao bound (CRB) for estimating x from 
its measurements, and show that this bound equals the error 
of the "oracle estimator" which knows the locations of the 
nonzero blocks of x. However, while the oracle estimator 
relies on information which is unavailable in practice, the CRB 
is known to be achievable by the maximum likelihood (ML) 
technique at high SNR. Unfortunately, the ML approach is 
NP-complete, and thus can probably not be implemented effi- 
ciently. Nevertheless, we proceed to show that both BOMP and 
BTH come within a nearly constant factor of the CRB at high 
SNR, for dictionaries satisfying suitable requirements. Once 
again, when each block contains one element, we can recover 
previously known guarantees for non-block sparsity [21] from 
our results. Furthermore, we show that in typical block sparse 
situations, the performance guarantees of block algorithms is 
substantially better than that of non-block techniques. 

The rest of this paper is organized as follows. The block 
sparse setting is defined in Section II, and the BOMP and 
BTH techniques are described in Section III. The adversarial 
noise model is then analyzed in Section IV. The treatment 
of random noise begins with the derivation of the CRB 
in Section V, while performance guarantees for this case 
appear in Section VI. Finally, the guarantees and the CRB are 
compared with the actual performance of BOMP and BTH in 
a numerical study in Section VII. 



II. Problem Setting 

A. Notation 

The following notation is used throughout the paper. Matri- 
ces and vectors are denoted by boldface uppercase letters M 
and boldface lowercase letters v, respectively. The £2 norm of 
a vector v is \\v || 2 and the spectral norm of a matrix M is 
I iVf || . The expectation of a random vector v will be denoted 
E{v} or, occasionally, E x {?;}, where the subscript is intended 
to emphasize the fact that the expectation is a function of the 
deterministic quantity x. The adjoint and the Moore-Penrose 
pseudoinverse of a matrix M are denoted, respectively, by 
M* and M\ while the column space of M is TZ(M). We 
denote by v [i] the ith c?-element block of a vector v of length 
N = Md. Thus 

- [v(i-i)d+i,V(i-i)d+2, ■ ■ ■ ,v ld ] T , l<i<M. (1) 

Consequently, we may write 

v=[v T [l],...,v T [M]] T . (2) 

Similarly, given a matrix M having TV columns, the submatrix 
M[i] contains the columns (i — l)d + 1, (i — l)d + 2, . . . , id 
of M, i.e., those columns of M which correspond to the ith 
block. The support supp(u) of v is defined as the set of indices 
of nonzero blocks of v; formally 

supp(u) = {i : v[i] 7^ 0}. (3) 

Given an index set /, the vector vj is constructed as the 
subvector of v containing the blocks indexed by /; in other 
words, if / = {«!,... , i p }, then 

v I =[v T [i 1 ],...,v T [i p ]] T . (4) 

Likewise, the submatrix Mj contains the column blocks 
indexed by /, so that 

M I =[M[i 1 ],...,M[i p }]. (5) 

To uniquely define vj and Mj, we will assume as a conven- 
tion that the elements of / are sorted, i.e., i\ < i 2 < ■ ■ ■ < i p . 

B. Problem Definition 

Let x e C N be a deterministic block-sparse vector, i.e., 
x consists of M blocks x[l], . . . , x[M] of size d, of which 
at most k are nonzero [19]. The maximum support size k is 
assumed to be known. The block sparsity restriction can then 
be written as 

x e X= {v e R N : |supp(w)| < k}. (6) 

For convenience, let S = supp(:c) be the support of the 
parameter x, and let s = \S\. Note the distinction between 
k and s: It is known that at most k blocks are nonzero, but 
the actual number of nonzero blocks s is unknown and may 
be smaller than k. In the sequel, it will be useful to define 

kmax| = max||£c[i]|| 2 , 

kmin| = min||aj[i]|| 2 . (7) 
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The block sparse model differs from the more common non- 
block sparsity setting: in the latter, it is assumed that a small 
number of entries (rather than blocks) in the vector x are 
nonzero. To emphasize this difference, we will occasionally 
refer to the non-block sparsity model as "ordinary" or "scalar" 
sparsity. 

We are given noisy observations 

y = Dx + w (8) 

where D G C LxN is a known, deterministic dictionary, and 
to is a noise vector. Our goal is to estimate x from the 
measurements y. It will be convenient to denote the ith column 
(or "atom") of D as d t . Thus we have 

D = [di, . . . , d d , dd+i, • • • , d 2d , • • • , djv-d+ij • • • , djv]. (9) 

S ■ ' " v ' S v ' 

D[l] D[2] D[M] 

We assume for simplicity that the dictionary atoms are nor- 
malized, || di|| 2 = 1. We also assume that the measurement 
system is underdetermined, i.e., the number of measurements 
L is less than the number of parameters N; thus, we must 
utilize the structure X, for otherwise we have no hope of 
recovering x from its measurements. Finally, we require that 
for any index set / of size |/| < k, the subdictionary Dj has 
full column rank. This latter assumption is needed to ensure 
that after a support set is chosen, one may estimate x using 
standard techniques for inverting an overcomplete set of linear 
equations, e.g., the least-squares approach. 

We will provide performance guarantees for two separate 
noise models. First, we consider the adversarial setting, in 
which the noise is unknown but bounded, 

\H\ 2 <e (10) 

for a known constant e > 0. In this case the goal is to 
provide performance guarantees which hold for all values of 
w satisfying (10). Second, we treat additive white Gaussian 
noise, in which 

W ~ N{0,(T 2 I). (11) 

In this case w is unbounded, and the goal will be to provide 
guarantees which hold with high probability. 

Following [19], we define the block coherence of D as 

^B=maxih\D*[%[D[i]\l (12) 

1^3 a 

We also define the sub-coherence 

v= max max |d*d,-|. (13) 

l<i<M (£-l)d+l<i=ij<£d 

The block coherence and sub-coherence are generalizations of 
the concept of the coherence, which is defined as 

V = ,™sx \d*d 3 \ (14) 

and applies to dictionaries regardless of whether they have a 
block structure. 

III. Techniques for Block-Sparse Estimation 

For reference and in order to fix notation, we now describe 
the two greedy algorithms for which we provide performance 
guarantees. 



a) Block-Thresholding (BTH): We propose the following 
straightforward extension of the well-known thresholding al- 
gorithm. Given a measurement vector y e C L , perform the 
following steps: 

1) Compute the correlations 

Pi = \\D*[i\y\\ 2 , i=l,...,M. (15) 

2) Find the k largest correlations and denote their indices 
by i\,...,ik. In other words, find a set of indices S — 
{ii,. . . , ik} such that pi > pj for alH e S and j ^ S. 

3) The reconstructed signal is given by 

xbth = argmin J\y - Dx\\ 2 . (16) 

£c:supp(cc)— S 

b) Block Orthogonal Matching Pursuit (BOMP): The 
BOMP algorithm, based on the OMP algorithm [22], was first 
proposed in [19]. 

Given a measurement vector y e C L , perform the following 
steps: 

1) Define r° = y. 

2) For each £ = 1, . . . , k, do the following: 

a) Set 

i / = argmax||r>*[i]r / - 1 || 2 . (17) 

i 

b) Set 

x l = argmin \\y - Dx\\ 2 . (18) 

£c:supp(cc)C{2i ,...,2^} 

c) Set r l = y — Dx l . 

3) The estimate is given by Xbomp = x k . 

c) Oracle Estimator: We will find it useful to analyze the 
oracle estimator, which is defined as the least-squares solution 
within the true support set, i.e., 

£c or = argmin \\x — x\\ 2 . (19) 

S:supp(cc) C5 

Using the notation introduced above, we have 

(x m ) s = (D^Ds^D^y, 
(x m ) s c = (20) 

where S c = {1, . . . , M}\S is the complement of the support 
set S. Note that the term "oracle estimator" is somewhat 
misleading, since x m relies on knowledge of the true support 
set 5, and is therefore not a true estimator. 

IV. Guarantees for Adversarial Noise 

We begin by stating our performance guarantees in the case 
of adversarial noise. The proofs of these results are quite 
technical and can be found in Appendix A. 

Theorem 1. Consider the setting of Section II with adversarial 
noise (10). Suppose that 

(l-(d-l)i/)|a; m i n | > 2ey/l + (d- l)i/+(2fe-l)d// B |a; max |. 

(21) 

Then, the BTH algorithm correctly identifies all elements of 
the support of x, and its error is bounded by 

^™- x ^ i- {d -i)?- {k -iw B (22) 
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Theorem 2. Consider the setting of Section II with adversarial 
noise (10). Suppose that 

(l-(d-l)z/)|z min | > 2eVl + [d - l)is+{2k-l)d^ B \x min \. 

(23) 

Then, the BOMP algorithm identifies all elements of supp(cc), 
and its error is bounded by 

l|SBOMP " ^ l-(d-l)L(k-l)d^ B (24) 

The following remarks should be made concerning Theo- 
rems 1 and 2. 

• Scalar sparsity: The scalar sparsity setting, in which x 
has no more than k nonzero elements, can be recovered by 
choosing d = 1. In this case, BOMP and BTH reduce to 
their scalar versions, which are called OMP and thresholding, 
respectively, and the block-coherence /i b equals the coherence 
/i of (14). Theorems 1 and 2 then coincide with the well-known 
results of Donoho et al. [5] for performance of scalar sparse 
signals under adversarial noise. As an example (and for future 
reference), the OMP performance guarantee is given below. 

Corollary 1 (Donoho et al. [5]). Let y = Dx + w be a 

measurement vector of a signal x having sparsity \\x\\q < k. 
Suppose that the coherence of the dictionary D satisfies 

|z min |(l-(2fc-l) M )>2 £ . (25) 

Then, OMP recovers the correct support pattern of x and 
achieves an error bounded by 

\\*om?-x\\1< y -^^. (26) 

Note that in the case of ordinary sparsity, d = 1, and 
therefore |x m ; n | can be defined simply as the magnitude of 
the smallest nonzero element in x. 

• Benefits and limitations of block sparsity: It is interesting 
to compare the achievable performance guarantees when one 
utilizes the block-sparse structure, as opposed to merely using 
ordinary (scalar) sparsity information. For concreteness, we 
focus in this discussion on a comparison between OMP and 
BOMP, but identical conclusions can be drawn by comparing 
the thresholding algorithm with its block-sparse version BTH. 

Consider a block sparse signal x as defined in Section II. 
Such a signal can also be viewed as a scalar sparse signal of 
length N = Md, having no more than sd nonzero elements. 
It is readily shown that the coherence [i satisfies v < p, and 
Ms < M [19]- Consequently, 



1- (d- l)u- (k- l)dp, B ~ 1- {sd- l)p, 

which implies that if the conditions for the performance guar- 
antees of both BOMP and OMP hold, then the performance 
guarantee (24) for BOMP will be at least as good as that of 
OMP (26). Moreover, in typical block-sparse settings, both 
v and II b will be substantially smaller than \i [19], and the 
guarantees for BOMP will then be considerably better. 

These results notwithstanding, it should be noted that 
BOMP should not automatically be preferred over OMP in 
any setting. This is because the condition (23) of Theorem 2 



can sometimes be weaker than that of OMP. Specifically, the 
factor 2e^J\ + (d — \)v in (23) is larger than the analogous 
term 2s in (25). 1 This implies that if the sub-coherence v is 
large, block sparse algorithms will not perform as well as their 
scalar counterparts. Such a result is to be expected: Highly 
correlated dictionary blocks may cause noise amplification, 
and in such cases, it may be preferable to separately correlate 
each atom with the measurements, rather than relying on the 
combined correlation of the entire block. Indeed, it would 
be quite surprising if a partition of any dictionary D into 
arbitrary blocks could be shown to perform as well as a scalar 
sparsity algorithm, since the former adds a restriction on the 
possible support patterns of the vector x. The lesson to be 
learned from this analysis is that block sparsity techniques 
are effective when the dictionary can be separated into blocks 
whose elements are orthogonal or nearly orthogonal. 

• Noiseless case: The situation in which y = Dx, i.e., no 
noise is present in the system, has been previously analyzed 
in the context of block sparsity in [19]. This setting can be 
recovered by choosing the noise bound e = 0. In this case, 
the condition (24) simplifies to 

(d- l)u+ (2k- l)dp B < 1 (28) 

and Theorem 2 then amounts to a guarantee for perfect 
recovery of x if (28) holds. This result for the noise-free 
setting has been previously demonstrated in [19, Thm. 3]. 

Similarly, by substituting e = into Theorem 1, one obtains 
a perfect recovery condition for BTH in the noiseless setting. 
Specifically, if the condition 

( d - + (2k - l)dfi B <1 (29) 

l^min | 

is satisfied, then BTH correctly recovers x from its noiseless 
measurements y = Dx. 

Since BTH is a much simpler algorithm than BOMP, it is 
not surprising that the necessary condition (29) for BTH is 
somewhat stronger than the corresponding condition (28) for 
BOMP. This difference between the conditions is indicative of 
the different strategies employed by the two techniques, and 
will be further discussed in Section VI. 

• Severity of the error: As in the scalar sparsity scenario, the 
presence of adversarial noise severely limits the ability of any 
algorithm to perform denoising. This is evident from Theorems 
1 and 2, which guarantee only that the distance between the 
estimates and the true value of x is on the order of the noise 
magnitude e. Given our detailed knowledge of the structure 
of the signal x, one would expect more powerful denoising 
capabilities for typical noise realizations. Consequently, in the 
remainder of this paper, we adopt the assumption of random 
noise, which cannot align itself so as to maximally interfere 
with the recovery algorithms. 

V. The Cramer-Rao Bound 

A central goal in assessing the quality of an estimator is 
to check its proximity to the best possible performance in the 

'The remaining terms in (23) are always no worse than the corresponding 
terms in (25). 
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given setting. To this end, it is common practice to compute 
the CRB for unbiased estimators [23], i.e., those techniques x 
for which the bias b(x) — E x {x} — x equals zero. The CRB 
is a lower bound on the mean-squared error MSE(x,x) = 
E x { || S — a? 1 1 § } for any unbiased estimator x. 

To utilize the information inherent in the block sparsity 
structure, we apply the constrained CRB [24]-[27] to the 
present setting. In the constrained estimation scenario, one 
often seeks estimators which are unbiased for all parameter 
values in the constraint set [24], [25]. However, as we will 
see below, this requirement is too strict in the block sparse 
setting. Indeed, in Theorem 3 we show that it is not possible 
to construct any method which is unbiased for all feasible 
parameter values. Consequently, a weaker, local definition 
of unbiasedness is called for, which we refer to as X- 
unbiasedness [27]. 

Intuitively, an estimator x is said to be X-unbiased at a 
point x e X if E x {aS} = x holds at the point x and at all 
points x in X which are sufficiently close to x. To formally 
define X-unbiasedness, we first recall the concept of a feasible 
direction. A vector v e C N is said to be a feasible direction 
at x if, for any sufficiently small a, we have x + av E X. We 
then say that x is X-unbiased at x if E x {a;} = x and if 



db(x + av) 



da 



= 



(30) 



for any feasible direction v. In other words, the bias is zero at 
x and remains unchanged, up to a first-order approximation, 
when moving away from x along feasible directions. This 
definition yields the following result, whose proof can be 
found in Appendix B. 

Theorem 3 (Cramer-Rao bound for block-sparse signals). 
Consider the setting of Section II in which the block sparse 
parameter vector x is to be estimated from measurements 
corrupted by Gaussian noise (11). 

(a) Suppose x contains fewer than k nonzero blocks, i.e., s < 
k. Then, no finite-variance estimator is X-unbiased at x. 

(b) Suppose x contains precisely k nonzero blocks, i.e., s = k. 
Then, any estimator which is X-unbiased at x satisfies 



MSE(S, x) > a 2 Tr ((D* S D S T 



(31) 



We recall that both the MSE and the CRB are functions 
of the unknown vector x, as is generally the case when 
estimating a deterministic parameter. It follows immediately 
from Theorem 3 that no finite-variance estimator can satisfy 
= x for all x e X, which explains why we 
previously avoided this simpler definition of unbiasedness in 
the constrained setting. Instead, restricting attention to a local 
unbiasedness requirement led to a finite CRB for almost all 
parameter values in x: specifically, those parameters whose 
support is maximal, | supp(a;)| = s = k. 

For maximal-support values of x, it is not difficult to show 
that the CRB (31) coincides with the MSE of the oracle 
estimator (20). In this case it is possible to get a sense for 
the value of the bound, as follows. From (44) of Lemma 1 
(see Appendix A), we have that none of the eigenvalues of 



(DgDs)- 1 are larger than 1/(1 - (d - l)v - (k - l)dfj, B )- 
Thus 



a 2 Tr((D: s D s )-i)< ^ 



1 



(d- l)u- (k- l)dfj, B 



kda 2 . 



(32) 



In other words, when the block coherence and sub-coherence 
of D are low, the bound of Theorem 3 will be close to 
kda 2 . This value is typically much lower than the total noise 
variance E{ Hi^Hl} = La 2 . Thus, at least according to the 
CRB, it is possible to achieve substantial denoising in the 
presence of random noise. This stands in contrast to the rather 
disappointing guarantees presented for adversarial noise in the 
previous section. We may thus hope that the performance will 
be improved when considering random noise. 

As opposed to the oracle estimator, which cannot be im- 
plemented in practice, it is well-known that the CRB can be 
asymptotically achieved at high SNR by the maximum like- 
lihood (ML) estimator [23]. However, in the present setting, 
computing the ML estimator is NP-hard, and thus impractical. 
Consequently, it is of interest to determine whether there 
exist efficient techniques which come close to the performance 
bound (31), at least for high SNR values. As we will show in 
the next section, this question is answered in the affirmative: 
greedy block sparsity techniques do indeed approach the CRB 
for sufficiently high SNR. 

VI. Guarantees for Gaussian Noise 

In this section, we analyze the performance of block sparse 
algorithms when the noise w is a Gaussian random variable 
having mean zero and covariance a 2 1. Our main performance 
guarantees are summarized in Theorems 4 and 5. The proofs 
of these theorems are found in Appendix C. 

Theorem 4. Consider the setting of Section II with additive 
white Gaussian noise w <~ iV(0, a 2 I). Suppose it is known 
that 

(1 - (d- l)i/)|a; m i„| - (2k - l)dfi B \x max \ 

> 2ay/2ad(l + (d - l)v) \ogN (33) 

for some constant a > l/(2d\ogN). Then, with probability 
exceeding 

0.8d(2ad\ogN) d / 2 - 1 



1 - 



Nad-l 



(34) 



the BTH algorithm identifies the correct support of x and 
achieves an error bounded by 



2a(l + {d- l)u) 



lXBTK - Xh -(l-(d-l)u-(k-l)d^r 



dka 2 log N. 
(35) 



Theorem 5. Consider the setting of Section II with additive 
white Gaussian noise w <~ iV(0, a 2 I). Suppose it is known 
that 

(1 - (d- l)v)\x min \ - (2k - l)d^ B \x m in\ 

> 2a v / 2ad(l + (d-l)v) \ogN (36) 
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for some constant a > l/(2d\ogN). Then, with probability 
exceeding (34), the BOMP algorithm identifies the correct 
support of x and achieves an error bounded by 



\x-BOMP — x\\ 2 < 



2a(l + (d-l)u) 



(1 - (d - l)v - (k - l)dp, B ) 2 



dka 2 log N. 
(37) 



We now provide some insights into the performance of 
block-sparse algorithms under random noise. 

• Random noise vs. adversarial noise: As noted in Sec- 
tion IV, performance guarantees in the case of adversarial 
noise can ensure a recovery error on the order of the total noise 
magnitude. This is a result of the fact that the noise could, in 
principle, be concentrated in a single nonzero component of 
x, whereupon it would be indistinguishable from the signal. 
However, for random noise, such an event is highly unlikely. 
Consequently, Theorems 4 and 5 provide much tighter perfor- 
mance guarantees: both theorems demonstrate that, with high 
probability, the estimation error is on the order of dka 2 log N, 
i.e., within a constant times log TV of the CRB presented in 
Section V. Since the noise variance E{||«;|| 2 } is given by 
Na 2 , and since typically dk log N <C N, we conclude that 
the block sparse algorithms have successfully removed a large 
portion of the noise, owing to the utilization of the union-of- 
subspaces structure. 

• BOMP vs. BTH: Comparing Theorems 4 and 5 leads 
to an important insight concerning the advantage of the more 
sophisticated BOMP algorithm over its simpler counterpart. 
Indeed, the guarantee for BOMP requires condition (36), 
which basically states that |x m ; n | must be larger than a 
constant multiplied by the standard deviation of the noise. 
By contrast, for the BTH guarantee one requires the stronger 
condition (33), which can be interpreted as requiring |x m i n | 
to be larger than a small constant times |x max |, plus another 
constant times the noise standard deviation. 

To explain this difference, recall from Section III that 
the BTH approach relies on a single support-identification 
stage in which the blocks most highly correlated with the 
measurements are chosen as the estimated support set S. 
Thus, for BTH to correctly identify the support, each block 
in S must be sufficiently large in magnitude to overcome 
interference from the noise and from the remaining blocks. 
Condition (33) can therefore be interpreted as a requirement 
that the magnitude |x m ; n | of the smallest nonzero block must 
be larger than the sum of the interference from the large 
nonzero blocks (the |x max | term) and the noise. By contrast, 
the BOMP algorithm iteratively identifies support elements, 
maintaining a residual vector r e containing the components of 
the measurement vector which have yet to be identified. Thus, 
BOMP requires only the ability to separately isolate each 
nonzero block, and hence its weaker condition (36), which 
necessitates only that |x m i n | be larger than the noise. 

Finally, it should be noted that when BTH and BOMP 
both identify the correct support set, the estimates of the 
two algorithms coincide, explaining the identical bounds on 
their performance. The conclusion from this analysis is that 
BOMP should be preferred if a wide dynamic range of block 
magnitudes is possible, but that when all blocks have roughly 



the same size, the simpler and more efficient BTH technique 
can be used. 

• Scalar sparsity: It is interesting to note that known results 
for scalar sparsity algorithms can be recovered from our block 
sparsity guarantees, by substituting d = 1 into Theorems 4 and 
5. For example, consider the BOMP guarantee (Theorem 5). 
In the scalar case, this algorithm is known as OMP, and its 
performance guarantee can be written as follows. 

Corollary 2. Let y = Dx + w be a measurement vector of 
a signal x having sparsity \\x\\o < k. Suppose the coherence 
H of D satisfies 



|x min |(l - (2k - l)fi) > 2a^2a\ogN 
for some a > 1. Then, with probability exceeding 
1 _ 0.8/V2 



(38) 



(39) 



N^y/alogN 

the OMP algorithm recovers the correct support of x, and 
achieves an error bounded by 

2a 



\xqmp - x\\ 2 < 



(i-(k-iM 



■ ka 2 log N. (40) 



Corollary 2 is nearly identical to [21, Thm. 4], with the only 
difference being that the constant 0.8/ y/2 ~ 0.566 in (39) is 
replaced in [21] with the slightly better constant w 
0.564. This slight discrepancy can be resolved if the more 
accurate version (88a) of Lemma 4 is used in the proof of 
Theorem 5, but the resulting expression becomes much more 
cumbersome in the block sparse case. 

• Block sparsity vs. scalar sparsity: A legitimate question 
is whether the incorporation of the block sparsity structure 
substantially assists estimation algorithms. In other words, do 
the performance guarantees of the block algorithms BOMP 
and BTH compare favorably with the results achievable on 
identical signals using scalar sparsity algorithms, such as OMP 
and thresholding? This question is examined numerically in 
the next section. 

VII. Numerical Experiments 

From a practical point of view, it is important to determine 
whether the use of block sparse algorithms contributes sig- 
nificantly to the performance of estimation algorithms. After 
all, any block sparse signal containing k nonzero blocks of 
size d can also be viewed as a sparse signal containing kd 
nonzero elements. Is there a significant benefit in using the 
block algorithms rather than the ordinary scalar versions? 

There are two possible approaches to answering this ques- 
tion. First, one may compare the performance achieved in 
practice by block sparse and scalar sparse algorithms. This 
requires a complete specification of the problem setting, in- 
cluding a choice of the parameter value x, which is unknown 
in practice. Alternatively, one can compare the performance 
guarantees for block sparse techniques, which were derived 
in Section VI, to the previously known guarantees for scalar 
approaches [28]. The performance guarantees apply to all 
parameter values having a specified sparsity level, and are 
therefore more general. However, there may be a gap between 
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TABLE I 

Performance Guarantees for OMP and Block-OMP 




Noise Variance Noise Variance 

(a) Block-OMP (b) Block-Thresholding 




Noise Variance Noise Variance 



(c) OMP (d) Thresholding 

Fig. 1. Median squared error as a function of the noise variance for block and scalar sparse estimation algorithms. The shaded region indicates the range 
of errors encountered for different parameter values. The dotted line plots the CRB. The thick solid line in Figs. 1(a) and 1(b) indicates the performance 
guarantees for the block sparse algorithms; no guarantee can be made for the scalar sparsity techniques in Figs. 1(c) and 1(d). 
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the guarantee and the performance observed in practice. In 
order to take advantage of both approaches, in the following 
we compare both the actual performance and the guarantees 
of the various algorithms discussed in this paper. 

In our experiments, we used dictionaries containing or- 
thonormal blocks. Such dictionaries were constructed by first 
generating a random Lx N matrix containing IID, zero-mean 
Gaussian random variables, and then performing a Gram- 
Schmidt procedure separately on the columns of each block. 
As a first experiment, we generated a variety of such dictionar- 
ies, and computed their coherence fi and block coherence [1b- 
(The sub-coherence of dictionaries generated in this manner 
is necessarily v = 0.) These values were used to compute 
performance guarantees for BOMP (using Theorem 5) and 
for OMP (using Corollary 2). We assumed throughout that 
the minimum norm |x m i n | among nonzero blocks equals 1 
and that the minimum nonzero element equals \j\fd. Some 
typical results are listed in Table I. To compute the guarantees 
in this table, the smallest value of a yielding a 99% probability 
of success was chosen. The resulting guarantee is listed in 
multiples of a 2 . For example, a value of Guarantee/cr 2 = 100 
means that \\x — x\\ 2 < IOOct 2 for 99% of the noise realiza- 
tions. Also listed in Table I are the maximum noise standard 
deviations <7 max for which the performance guarantees still 
hold. A dash ( — ) indicates that no guarantee can be made for 
the given setting even in the noise-free case. 

It is evident from Table I that the block sparse algorithm 
BOMP is guaranteed to perform over a much wider range 
of problem settings than the scalar OMP approach. Further- 
more, even when performance guarantees are provided for 
both techniques, those for BOMP are substantially stronger. 
To provide merely one striking example from Table I, note 
that 50 measurements suffice for BOMP to identify a signal 
composed of a single 5-element block among a set of 1200 
possible blocks, whereas for OMP to identify such a signal 
at the same noise level, as many as 3000 measurements are 
required. The reason for this advantage is clear: the OMP 
algorithm must separately identify each nonzero component 
of the signal, and must therefore choose among a total of 
( 12 ° ) rj 2.1 • 10 13 possible support sets. This is obviously 
more challenging than identifying one nonzero block among 
a set of 1200 possibilities. Clearly, then, knowledge of a block- 
sparse structure can substantially improve performance if it is 
correctly utilized. 

Table I also compares the performance guarantees with 
the CRB of Theorem 3. The CRB is listed for a random 
choice of support set S containing precisely k nonzero blocks; 
however, choosing different sets S only has a small effect on 
the value of the bound. The gap between these lower and 
upper bounds is not inconsiderable, and is typically on the 
order of a factor of 10. There are several reasons for this 
gap. First, the performance guarantees plotted above indicate 
an error which is obtained with 99% confidence, whereas the 
CRB is a bound on the MSE. By its very nature, the MSE 
averages out unusually disruptive noise realizations, and thus 
tends to be more optimistic. Second, different values of x 
may yield significantly different performance; the performance 
guarantees apply to all values of x, whereas the CRB is 



plotted for a single, typical parameter value. Third, some loss 
of tightness undoubtedly results from the derivations of the 
theorems, i.e., there may still be room for improved bounds. 

To measure the relative influence of these factors, we 
performed another experiment, in which the guarantees were 
compared with the actual performance of the various algo- 
rithms. To overcome the aforementioned pessimistic effect of 
a guarantee which holds with overwhelming probability, in 
this second experiment we computed guarantees with a 50% 
confidence level. In other words, these are assurances on the 
median of the distance between x and its estimate, which 
captures the typical estimation error. We also computed the 
actual median error of the various algorithms for a variety of 
parameter values. 

The details of this experiment are as follows. We constructed 
a 3000 x 6000 dictionary D containing M = 1200 blocks 
of d = 5 atoms each, using the orthogonalization algorithm 
described above. The resulting coherence of D was fj, = 0.094, 
the block coherence was /i# = 0.026, and since each block 
was orthonormal, the sub-coherence was v = 0. We then 
constructed a variety of block sparse vectors x, each having 
s = 3 nonzero blocks, with |x m i n | = 2\[d and |a; max | = i\fd. 
We chose the parameter vectors so as to cover as wide a 
range of scenarios as possible, within the aforementioned 
requirements. For example, some parameter vectors contained 
a block with a single nonzero component whose value was 
|#max|, while other vectors contained a block with each of 
the d elements receiving a value of |x max |/ \fd. Although 
it is clearly not feasible to cover the full range of possible 
parameter vectors, it is hoped that in this way some sense is 
given of the variability in performance for different parameter 
values. Indeed, as shown below, different parameters often 
yield widely differing estimation errors. 

For each choice of a parameter vector, 20 noise realizations 
were generated and the resulting measurement vector y was 
computed using (8). The BOMP, BTH, OMP, and thresholding 
algorithms were then applied to each of the measurement 
vectors. For every technique and each parameter vector, the 
median estimation error (among the noise realizations) was 
computed. The range of median estimation errors obtained for 
different choices of x is plotted as a shaded area in Fig. 1. 

In the present setting, neither of the scalar sparsity algo- 
rithms was capable of providing a performance guarantee. For 
BOMP and BTH, performance guarantees were available, and 
these are plotted as a solid line in Fig. 1. These guarantees 
are valid only up to a certain maximal noise variance, at 
which point the solid line in Fig. 1 stops. The results are 
also compared with the CRB of Theorem 3. It should be 
emphasized that the CRB is a bound on the MSE, rather 
than the median error, although in practice the differences 
between these two quantities appear to be quite small. It is 
also worth recalling that the CRB is a bound on unbiased 
estimators, while all of the techniques discussed herein are 
biased; nevertheless, it is evident that the CRB still provides 
a rough measure of the optimal performance of the proposed 
algorithms. 

Several comments are in order concerning Fig. 1. First, 
the performance of both block sparse algorithms exhibits a 
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Noise Variance 

(b) Block-Thresholding 



Fig. 2. Median squared error as a function of the noise variance for block sparse estimation algorithms. The shaded region indicates the range of errors 
encountered for different parameter values. The dotted line plots the CRB. The thick solid line in Fig. 2(a) indicates the performance guarantee for BOMP; 
no guarantee can be made for BTH. The deteriorated performance of BTH is a result of the existence of low-magnitude blocks. 



transition: near-CRB performance for low noise levels dete- 
riorates substantially when the noise level crosses a certain 
threshold. This behavior qualitatively matches the predictions 
of the performance guarantees, which ensure support recovery 
and near-CRB performance for sufficiently low noise levels. 
The threshold at which this transition occurs is identified 
fairly accurately for BOMP, and less so for BTH, although 
it is possible that there exist some (untested) parameter values 
for which the BTH transition occurs at lower noise levels. 
However, the numeric value of the performance guarantee 
is somewhat pessimistic: while the observed performance is 
close to the CRB for all parameter values, analytically one 
can guarantee only that the median error will not be larger 
than approximately 10 times the CRB. This result is most 
likely due to the various inequalities employed in the proofs 
of Theorems 4 and 5. Indeed, since the correct support is 
identified with high probability for most noise realizations, 
the BTH and BOMP algorithms will likely tend to coincide 
with the oracle estimator, whose error equals that of the CRB. 
The question of formally proving such a claim remains a topic 
for further research. 

The advantages of the block sparse approach become ev- 
ident when compared with scalar sparsity algorithms (Figs. 
1(c) and 1(d)). For the scalar techniques, no performance 
guarantees can be made in the present setting. Unlike the block 
sparsity algorithms, the scalar approaches fail to recover the 
correct parameter vector even when the noise is negligible, 
and for some parameter values, their error does not converge 
to the CRB. The thresholding algorithm, in particular, ceases to 
improve (for some parameter values) as the noise is reduced, 
while the OMP approach, although significantly better than 
thresholding, does not converge to the CRB as do the block 
sparse techniques. This demonstrates the advantages of utiliz- 
ing the fact that the signal is known to have a block-sparse 
structure. 

The performance of BOMP (Fig. 1(a)) is quite similar to 
that of BTH (Fig. 1(b)) in the experiment above. This is not 



surprising when one compares our problem setting with the 
guarantees of Section VI. Indeed, as we have seen, the primary 
difference between the BOMP and BTH algorithms is that the 
one-shot support estimation employed by BTH causes large- 
magnitude blocks to overshadow small-magnitude nonzero 
blocks. In the setting of Fig. 1, the range of magnitudes 
between |x max | = 3>/d and |x min | = 2\fd is not very large, 
and therefore BTH performs nearly as well as BOMP. The 
advantages of BOMP become readily apparent if one considers 
a wider dynamic range. This is illustrated in Fig. 2, in which 
the setup is identical to that of the previous experiment, 
except that parameter vectors having |x m i n | = 0.1^/d and 
|a;max| = Vd were chosen, yielding a 10-fold dynamic range 
in the block magnitudes. In this case, while the guarantee for 
BOMP is hardly changed, the conditions for Theorem 4 no 
longer hold, so that nothing can be ensured concerning the 
BTH technique. Indeed, in Fig. 2 we see that BTH performs 
poorly for some parameter values even when the noise level 
is low, and its performance is no longer proportional to the 
CRB. 

VIII. Conclusion 

In this paper, we analyzed the performance of the greedy 
block algorithms BOMP and BTH under the adversarial and 
Gaussian noise models. In the adversarial setting \\w\\2 < e, 
we showed that the estimation error equals a constant times 
the noise bound e, which shows that performance in this case 
will not necessarily reduce the noise power. The situation is 
much better in the presence of random noise, where we saw 
that, under suitable conditions, greedy techniques obtain an 
error on the order of dka 2 log N with high probability; this is 
substantially lower than the input noise power Na 2 . Indeed, 
the BTH and BOMP algorithms come close to the CRB and 
the error of the oracle estimator. 

There remain many open questions concerning the perfor- 
mance of block sparse techniques under random noise. For ex- 
ample, for scalar sparsity, performance guarantees for convex 
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relaxation techniques do not require assumptions on the SNR. 
An important challenge is to determine whether similar SNR- 
independent results can be demonstrated for block convex 
relaxation techniques such as L-OPT. Furthermore, it is well- 
known that scalar sparsity guarantees can be strengthened 
if the restricted isometry constants of the dictionary D are 
known, as is the case, for example, when D is chosen from 
an appropriate random ensemble. Thus, it is also of interest to 
provide guarantees for block techniques under random noise 
based on an extension of the RIP to the block sparse setting. 
One such extension has already been proposed in [11], and its 
application to the Gaussian noise model may provide tighter 
bounds for some performance algorithms. 

Appendix A 
Proofs for Adversarial Noise 

We begin by providing several lemmas which will prove 
useful for the analysis under both the adversarial and the 
Gaussian noise models. 

Lemma 1. Given a dictionary D having block coherence \ib 
and sub-coherence v, we have 



\D*[i}D\j]\\ <d^ B for all i^j 



and 



\\D[i]\\ 2 = \\D*[i]D[i\\\ <l + (d-l)v. 
Ifl-(d- l)u > 0, then 



(41) 



(42) 



(43) 



Suppose \ — (d—l)v—(k — l)dfi B > and let I be an index 
set with \I\ < k. Then 



(DjDjy 



< 



1 



1 - (d- l)v- (k- 1)<W 



(44) 



Proof: The bound (41) follows directly from the definition 
(12) of block coherence. To prove (42)-(43), observe that the 
diagonal elements of the matrix D*[i]D[i] equal 1, while the 
off-diagonal elements are bounded in magnitude by v. There- 
fore, by the Gershgorin circle theorem [29], all eigenvalues of 
D*[i]JD[«] are in the range [1 — (d— l)u, 1 + (d— l)v\, demon- 
strating (42). Furthermore, it follows that the eigenvalues of 
(D*[i]D[i])~ 1 are in the range [(1 + (d - l)^)" 1 , (1 - (d - 
1» _1 ], leading to (43). 

It remains to prove (44). To this end, let |7| = £ < k and 
write D*jDj as 



D}D! 



/M[l,l] M[l,2] ••• M[l,£}\ 
M[2,l] M[2,2] ••• M[2,£] 

\M[£,1] M[£,2] ••• M[l,l]J 



(45) 



where each M[i, j] is, a dxd matrix containing the correlations 
between two blocks of dictionary atoms. From the definition 
of block coherence, we have 



\M[i,j]\\<d^B, for alii £ j. 



(46) 



By a generalization of the Gershgorin circle theorem [30, 
Thm. 2], it follows that all eigenvalues A of DjDj satisfy 

\\M[i,i]-XI\\<^2\\M[i,j]\\ < (t-l)dnB 

<{k-l)dnB- (47) 

Now, from the definition of sub-coherence, the off-diagonal 
elements of M[i,i] are no larger in magnitude than v, while 
the diagonal elements of M[i, i] all equal 1. Therefore, by the 
Gershgorin circle theorem, given an arbitrary constant A, all 
eigenvalues of the d x d matrix M[i, i] — XI are in the range 
[1 - A - (d - 1 - A + (d - l)u\. Consequently 



\M[i, i] - XI\\ > 1 - A - (d - l)v. 



(48) 



Combining with (47) and rearranging, we conclude that all 
eigenvalues of D*jDi satisfy 



A> 1 - (d - l)v - (k - l)dm 



(49) 



Consequently, the eigenvalues of (D}Di) 1 are no larger 
than (1 - (d - l)v - (k - establishing (44). ■ 

Lemma 2. Consider the setting of Section II, and suppose it 
is known that 



max ||D*[i']w||2 < r 

1<7<M" U> " 



(50) 



for a given value t > 0. If the dictionary D satisfies 

{l-(d-l)v)\x maK \ > 2t+ (2s - l)dfj, B \x max \ (51) 
then 

max\\D*\j]y\\ 2 >max\\D*{j}y\\ 2 (52) 

jes rfs 

where S — supp(a;). 

If (51) is replaced by the stronger condition 



(1 - (d- l)u) |x min | >2t + (2s - l)dfi B \ I (53) 



then 



MD*\j]y\\ 2 >wax\\D*\j]yh. (54) 



Proof: The proof is an extension of [21, Lemma 3] to the 
block-sparse case, and is ultimately inspired by [5]. We first 
note that 



max||D*[j]y|| 2 = max 

3<£S 3<£S 



D*\j]w + D*\j]D[i\x{i\ 



< max||£>*b>l|2 + max]T \\D*\j]D[t\ 



(55) 



By (50), the first term in (55) is smaller than r. Together with 
(41), we obtain 

max ||D*[j]y|| 2 < t + sd^ B \x max \ < T + kdfi B \ I- (56) 
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On the other hand, 



Substituting (50), (63), and (64) into (62) provides us with 



max||D*[j]y|| 2 = max 



D*\j}w + Y,D*\j}D\iM 

ies 



>m & x\\D*[j}D[j}x[j}\\ 2 

7E0 



— max 



D*\j]w+ £ D*\j]D[i}x\i} 



(57) 



As we have seen in the proof of Lemma 1, the eigenvalues of 
D*\j]D\j] are bounded in the range [l-(d-l)v, \+{d-l)v\. 
Consequently 

max\\D*\j]D\j]x\j]\\2 > max(l - (d - l)v)\\x{j}\\ 2 
its jes 

= (l-(d-l)v)\x max \. (58) 
Combining this result with (57), we have 
max\\D*\j]y\\ 2 > (1 - (d - l)v)\x max \ 

-max ]T \\D*\j}D[i]x\i]\\ 2 -Tg^\\D*\j]w\\2. 

(59) 

Together with (50) and (41), this implies that 

ma x||-D*[j]y||2 

> (1 - (d - l)v)\x max \ - (k - l)|x max |^ B - r 
= (1 - (d - l)v)\x max \ - (2k - l)|x max |efyiB - 2t 

\dnB+r. (60) 

Merging the results (56) and (60) yields 

max\\D*[j]y\\ 2 > max \\D*[j]y\\ 2 

+ (1 - (d- l)f)|x max | - (2k - l)|x max |d/j B - 2t. 

(61) 

Consequently, if (51) holds, then (52) follows, as required. 
In a similar fashion, observe that 



min||D*[j]y|| 2 = min 



Y,D*\j]D\i]x\i)+D*\j]w 



>mm\\D*[j}D[j]x{j}\\ 2 

-max ]T \\D*\j]D\i]x\i]h-\\D*\j]wh. 

1 ies\{j} 

(62) 

As noted previously, all eigenvalues of £)*[j]£)[j] are larger 
than or equal to 1 — (d — l)v, and therefore 



mm\\D*\j]D\j]x\j]\\2 > (1 - (d - l)v)\ Xmin \. 
Furthermore, using (41) we have, for i ^ j, 

||D*[j]£>[i]aj[i]|| 2 < \\D*[j]D[i}\\ |x max | < d^ B \x D 



(63) 



(64) 



mm\\D*[j]y\\ 2 

> (1 - (d - l)^)|x min | - (k - l)d^ B \x max \ - t 
= (1 - (d - l)^)|x min | - (2k - l)dfi B \x max \ - 2t 
+ kdfi B \x max \ + t. (65) 

Finally, using (56) we obtain 

mm||£)*[j]y|| 2 > max||D*[j]y|| 2 

jes rfs 

+ (1 - (d- l»|x min | - (2k - l)d/j B |x max | - 2t. 

(66) 

Therefore, if the condition (53) is satisfied, then (54) holds, 
completing the proof. ■ 
We are now ready to prove Theorems 1 and 2. 

Proof of Theorem 1: Using (10) and (42), we have for 

all j 



\D*[j}w\\ 2 <\\D[ 



j\\ 2 < ey/l + (d-l)v. (67) 



Thus, (50) holds with r = ey/l + (d - \)v. 

In light of (21), the condition (53) for the second part of 
Lemma 2 holds, and therefore, by Lemma 2, we conclude 
that (54) holds. It follows that all blocks D[i] with i e S are 
more highly correlated than the off-support blocks D[i],i £ S. 
Thus, the estimated support S contains the true support set S 
(with the possible addition of superfluous indices if s < k). 
It follows from the definition (16) of Sbth that (Sbth)§ = 



D~y, and thus 



\x - 2bth|| 2 = \\x§ - (x B tr)§\\ 2 
= \\DtD § x s -Dty\\l 
< \\D^\\ 2 -\\y-D § x\\ 2 2 



\D 



t II 2 



W\ 



(68) 



where we have used the fact that £)t.D- = I, which follows 
from our assumption that Dj has full row rank for any set I 
of size s (see Section II). 

Since |x m i n | < |x max |, it follows from (21) that 



1 - (d- l)u > (2k-l)dfi B - 
Therefore, we may apply (44), yielding 



|^t||2 = || ( _ D * jDs) -l| 



< 



1 



(69) 



(70) 



l-(d-l)v-(k-l)dfj,B 

Combining this result with (68) and using (10), we obtain (22), 
as required. ■ 
Proof of Theorem 2: As shown in the proof of Theorem 1, 
it follows from (10) that (50) holds with r = e-sjl + (d- l)v. 
From (23) we then have 

(1 - (d- lH|x min | >2t+ (2k - l)dfi B \x miD \. (71) 

Since |x max | > |x m i n |, this implies the condition (51) for the 
first part of Lemma 2. Thus, by Lemma 2, the dictionary block 
most highly correlated with y is a block within the support S 
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of x. In other words, the first iteration in the BOMP algorithm 
correctly identifies an element within the support S. 

The proof continues by induction. Assume we have reached 
the £th iteration with 2 < £ < s and that all previous iterations 
have correctly identified elements of S. In other words, using 
the notation of Section III, we have i\, . . . , ie-i e S. 

By definition, we now have 

r l = y - Dx e i = Dx l :_1 + w (72) 

where x l 1 = x — x l ~ x is the estimation error after I— 1 itera- 
tions. Since supp(a;) = 5* and, by induction, supp(£c £_1 ) C S, 
we have supp(cc ) C S. Furthermore, £ — 1 < s, so that 
supp(a; £_1 ) contains less than s elements, and is thus a strict 
subset of S. It follows that at least one nonzero block in x l ~ x 
is equal to the corresponding block in x. Therefore 

max p^ 1 ^'] || 2 > |x min |. (73) 

j 

To summarize, by (72), r e can be thought of as a noisy 
measurement of the block sparse vector x , which contains 
a block whose norm is at least |x m j n |. Using (73) and (23), we 
find that the condition (51) holds for this modified estimation 
problem. Consequently, by Lemma 2, we have 

maxllZrijV- 1 ^ > max || -D* b"]^ -1 lb- (74) 

Therefore, by (17), the £th iteration of the BOMP algorithm 
will choose an index belonging to the correct support set 
S, as long as £ < s. 

Since the BOMP algorithm never chooses the same support 
element twice, we conclude that precisely the s elements of 
S will be identified in the first s iterations. If s < k, then 
the remaining iterations will identify some additional elements 
not in S, so that ultimately the estimated support set S = 
{ii, . . . , ik} will satisfy S D S. The estimate Sbomp therefore 
satisfies (xbomp)§ = D~y. Following the procedure (68)- 
(70) in the proof of Theorem 1, we obtain in an identical 
manner the required result (24). ■ 

Appendix B 
Proof of Theorem 3 

To compute the CRB, we must first determine the Fisher 
information matrix J(x) for estimating x from y of (8). This 
can be done using a standard formula [23, p. 85] and yields 

J(x) = \d*D. (75) 
<j z 

We now identify, for each x e X, an orthonormal basis 
for the feasible direction subspace, which is defined as the 
smallest subspace of C N containing all feasible directions at 
x. To this end, denote by the ith column of the N x N 
identity matrix. Consider first points x e X for which s < k. 
In other words, these are parameter values whose support S 
contains fewer than k elements. For such values of x, we have, 
for any e and any 1 < i < N, 

| supp(a; + ee;)| <|S , | + l<fc+l</c (76) 

and therefore x + ee, e X for any e and for any i. 
Consequently, the set of feasible directions at x includes 



{ei, . . . , e^v}, and the feasible direction subspace is therefore 
C N itself. Thus, for values x containing fewer than k nonzero 
blocks, a convenient choice of a basis for the feasible direction 
subspace consists of the columns of the identity matrix. 

Next, consider maximal-support parameter values, i.e., vec- 
tors x for which s = k. It is now no longer possible to add 
any vector e, to x without violating the constraints. Indeed, it 
is not difficult to see that the only feasible directions are linear 
combinations of the unit vectors ei for which i belongs to one 
of the blocks in S. These unit vectors can thus be chosen as 
a basis for the feasible direction subspace. 

Let U(x) be a matrix whose columns comprise the chosen 
orthonormal basis for the feasible direction subspace at x. 
Note that the dimensions of U(x) change with x; specifically, 
U(x) = Inxn when \S\ < k, and U(x) is an N x sd 
matrix otherwise. A necessary condition for a finite-variance 
X-unbiased estimator to exist at a point x is [27, Thm. 1] 

Tl(U(x)U*(x)) CK(U(x)U*(x)J(x)U(x)U*(x)). (77) 

When s < k, we have U(x) — I. In this case, using (75), the 
condition (77) becomes 

C N CK(J(x)) =TZ(D*D). (78) 

Since the dimensions of D are L x N with L < N, the rank 
of D*D is at most L, and thus 1Z(D*D) cannot include the 
entire space C^. We conclude that in this case, (77) does not 
hold, and therefore no X-unbiased estimator exists at points x 
for which |S| < s, proving part (a) of the theorem. 

Let us now turn to maximal-support parameter values x. As 
we have seen above, in this case the matrix U(x) consists of 
the columns for which i is an element of a block within 
the support of x. Therefore, the product DU(x) selects those 
atoms of D belonging to blocks within S, i.e., DU(x) = D$- 
Using (75), this leads to 

U*(x)J(x)U(x) = \d* s D s (79) 
a 

which is invertible by assumption (see Section II). It follows 
that the condition (77) holds for maximal-support parameters 
x. One can therefore apply [27, Thm. 1], which states that for 
such values of x, 

MSE(x,x) > Tr(lJ(x) (U*(x) J{x)U{x)) ] U*{x)\ . 

(80) 

Combining with (79) and using the fact that U*(x)U(x) = I, 
we obtain (31), proving part (b) of the theorem. 

Appendix C 
Proofs for Gaussian Noise 

We begin with two lemmas which prove some useful 
properties of the Gaussian distribution. The first of these is 
a generalization of a result due to Sidak [31]. 

Lemma 3. Let V\,..., Vm be a set of M jointly Gaussian 
random vectors. Suppose that Eji^} = for all i, but that the 
covariances of the vectors are unspecified and that the vectors 



13 



are not necessarily independent. We then have 

Pr{||vi || 2 < ci, ||u 2 || 2 < c 2 , • • • , H2 < c M } 
> Pr{||vi|| 2 < ci} • Pr{||v 2 || 2 < c 2 } - - - 

■■■^{\\v M h < cm} - (81) 

Proof: We will demonstrate that 

Pr{||ui|| 2 < ci, ||v 2 || 2 < c 2 , . . . , \\v M || 2 < c M } 

> Pr{||wi|| 2 < ci}Pr{||t> 2 || 2 < c 2 ,...,||w M || 2 < c M }- 

(82) 

The result then follows by induction. For simplicity of nota- 
tion, we will prove that (82) holds for the case M = 2; the 
general result can be shown in the same manner. 

Denote by f(vi\v 2 ) the pdf of v\ conditioned on v 2 . Ob- 
serve that, for a deterministic value w, the pdf f(vi\w) defines 
a Gaussian random vector whose mean depends linearly on 
w, but whose covariance is constant in w. Therefore, using a 
result due to Anderson [32], it follows that 



Pr{||wi|| 2 < Ci\v 2 = aw} 



f{u\\aw)du (83) 



||iti || 2 <ci 



is a non-increasing function of a. 

Next, denoting by f(v 2 ) the marginal pdf of v 2 , we have 

a(ci,c 2 ) = Pr{||wi|| 2 < ci\ ||w 2 || 2 < c 2 } 

= J\M\2< C1 I\\ W \\ 2 < C2 f(u\w)f(w) dw du 
Pr{||v 2 || 2 < c 2 } 
Pr{||«i|| 2 < ci|w 2 = w}f(w)dw 



lu||2<C2 



w\\ 2 <c 2 



f(w) dw 



(84) 



Thus, the function a(ci,c 2 ) is a weighted average of ex- 
pressions of the form Pr{||i>i|| 2 < c\\v 2 = w} for values 
of w satisfying ||tf|| 2 < c 2 . However, as we have shown, 
Pr{||wi|| 2 < ci|w 2 = w} is non-increasing in ||w|| 2 . Conse- 
quently, a(ci,c 2 ) is non-increasing in c 2 . 

On the other hand, observe that as c 2 — > 00, the probability 
of the event ||f 2 || 2 < c 2 converges 1. Thus we have 



lim a(d,c 2 ) = Pr{||wi|| 2 < ci} . 

c 2 — »oo 



(85) 



Combined with the fact that a(ci,c 2 ) is non-increasing in c 2 , 
we find that 

a(ci,c 2 ) > Pr{||ui|| 2 < ci} forallci,c 2 . (86) 

Using the definition of a(ci,c 2 ) and applying Bayes's rule, 
we obtain 

Pr{||vi || 2 < ci, ||u 2 || 2 < c 2 } 

>Pr{|M 2 <Ci}Pr{||i; 2 || 2 <c 2 } (87) 

and thus complete the proof. ■ 
Our next lemma bounds the tail probability of the chi- 
squared distribution. 



Lemma 4. Let u be a d-dimensional Gaussian random vector 
having mean zero and covariance I. Then, for any t > 1, we 
have 



u 



n (d-2)llfd/2l d _ 2 e/2 

~ ] ~ 2 d / 2 -ir(d/2) 
< i).m d - 2 e- t2 / 2 



(88a) 
(88b) 



where T{z) = J °° t z 1 e l dt is the Gamma function and 

n\\ = "[[ (n - 2i) (89) 

0<i<n/ 2 

is the double factorial operator. 

Of the two bounds provided in (88), the first is some- 
what tighter, but obviously more cumbersome. For analytical 
tractability, we will use the latter bound in the sequel. 

Proof of Lemma 4: The expression ||w|| 2 is distributed 
as a chi-squared random variable with d degrees of freedom. 
Therefore, its tail probability is given by [33, §16.3] 

Y{d/2,t 2 /2) 



Pr{\\u\\l>t 2 } 



T(d/2) 



(90) 



where T(a, z) is the incomplete Gamma function T(a, z) = 
f z °° t a - x e-* dt. It follows from the series expansion of T(a, z) 



that [34, §6.5.32] 



2' 2 



-t 2 /2 



< 



\t d + (d- 2)t a 



2 d/2-l t 2 

+ (d - 2){d - A)t d - A + . . . + (d - 2)!!t" 



(91) 



where m = 1 when d is odd and m = 2 when d is even. Note 
that (91) holds with equality for even d, but the inequality is 
strict for odd d. Since t > 1, we can enlarge each of the terms 
in the square brackets in (91) by replacing it with (d — 2)\\t d . 
The total number of terms in brackets is \d/2] , yielding 



d 1? 
2' 2 



< f 

" 2 d /2-! 



2 (d-2)!! 



(92) 



Substituting into (90) demonstrates (88a). 

To prove (88b), we distinguish between even and odd values 
of d. Assume first that d is even and denote d = 2p. We then 
have 

r(d/2) = r(p) = (p-l)! (93) 

and 

(d - 2)!! = (2jj - 2)!! = 2 p - 1 (p - 1)!. (94) 
Substituting these values into (88a) and simplifying yields 

Pr{\\u\\ 2 2 >t 2 } <^t d - 2 e- t2 / 2 (95) 

which clearly satisfies (88b). 

Similarly, assume that d is odd and write d = 2p + 1. 
Substituting the formula 

(2p-l)!V5F 



r(d/2) = r(p + i/2) 

into (88a), we obtain 

Pr{H^>t 2 }< /: ' {/ 



2p 



-2g-*72_ 



(96) 



(97) 
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It is easily verified that 
^ d+ 1 



< 0.8d for all d > 1. 



(98) 



Substituting back into (97) yields the required result. ■ 
Our next result applies more specifically to the block sparse 
estimation setting. Following [4], [21], we consider the event 

B=| i max M ||D*[z]«,||2<r 2 | (99) 



where 



r 2 = 2daa{\ + (d - l)v) log N 



(100) 



for a given a > l/(2rflog N). We then have the following 
lemma. 

Lemma 5. Under the setting of Section II, assume that w is a 
Gaussian random vector with mean zero and covariance a 2 1. 
Then, the probability of the event B of (99) is bounded by 

Q.^adlogNY' 2 - 1 



Vi{B} > 1 



(101) 



]\[ad-l 

Proof: Observe that D*[i]w is a rf-dimensional Gaussian 
random vector with mean zero and covariance a 2 D*[i]D[i]. 
Therefore, the random vector 

u = —(D*[i]D[i])~ 1 l 2 D*[i]w (102) 
a 

is a rf-dimensional Gaussian random vector with mean zero 
and covariance I. We thus have 

Pr{||ir[zH| 2 < r 2 } = Pv{a 2 \\(D*[i\D[i\) 1/ Ml < r 2 } 
>Pr{a 2 \\D*[mm-H\l<r 2 } 

- Pr { HI '- . 2 (l + (I-l).) 

(103) 

where, in the last step, we used (42). Using Lemma 4 and 
substituting the value (100) of t 2 , we obtain 



where 



Pr{||D*[;H| 2 < r 2 } > 1 — 77 

7]= 1 - 0.8d(2adlogN) d/2 - 1 cxp(-dalogN) 
OM^adlogN) 11 / 2 - 1 



(104) 



1 



N° 



(105) 



Using Lemma 3, we have 

M 

V,{B}>^\¥,{\\D^wf 2 <r 2 } 



(1- V ) M . 



(106) 



When T] > 1, the bound (101) is meaningless and the theorem 
holds vacuously. Otherwise, when rj < 1, we have 



Pr{B} > 1 - Mr] 



(107) 



where we used the fact that (1 — r\) M > 1 — Mi] whenever 
i] < 1 and M > 1. Substituting the value of r/ from (105) and 
recalling that N = Md yields the required result. ■ 
We are now ready to prove Theorems 4 and 5. 



Proof of Theorem 4: By Lemma 5, the event B of (99) 
occurs with probability exceeding (34). Furthermore, using 
(33), it follows from Lemma 2 that under the event B, all 
blocks in the correct support set S are more highly correlated 
with y than the off-support blocks. Consequently, when B 
occurs, we have S C S, where S is the support estimated b^ 
the BTH algorithm. Note, however, that the estimated set S 
will contain additional blocks not in S if s < k. It follows 
that 



\x - SbthIII = - (S B TH 
D%D §Xd -D\y\\ 2 

\\D%w\\ 2 2 

EH" 



< 



s » * s 



<\\mD 3 ) 



-a ii 2 



l \w\ 



(108) 



t 



where we have used the fact that D~Dg = I, which is a 
consequence of the assumption that Dg has full row rank 
(see Section II). Using (44) and (99), we have that when B 
occurs 

fcr 2 



\x - CCBTHH2 < 



(109) 



(1 - (d- [k- l)d]i B ) 2 ' 

Substituting the value (100) of r yields the required result 
(35). ■ 
Proof of Theorem 5: It follows from Lemma 5 that the 
event B occurs with probability exceeding (34). Our goal in 
this proof will thus be to show that, if B does occur, then 
the BOMP algorithm correctly identifies all elements of the 
support S of x (although some off-support elements may be 
identified as well if s < k). The remainder of the proof will 
then follow the steps of the proof of Theorem 4. 

To demonstrate that the correct support is recovered, we 
begin by analyzing the first iteration of the BOMP algorithm. 
This iteration chooses a block i\ having maximal correlation 
||-D*[*i]y||2 with the measurements y. Now, since |x max | > 
|#min|, the condition (36) implies (51), with r given by (100). 
Consequently, by Lemma 2, under the event B we find that the 
first iteration of BOMP identifies an element i\ in the correct 
support set S. 

To show that the next s— 1 iterations of the BOMP algorithm 
also identify support elements, we proceed by induction. 
Specifically, assume that I — 1 < s iterations have correctly 
identified elements i\, . . . , ie-i, all of which are in the support 
set S. As in the proof of Theorem 2, define the estimation error 
after I — 1 iterations as x E 1 = x — x e '~ x . By the induction 
hypothesis, supp(5) C S, and clearly supp(a;) = S. Thus 
supp(S) C £, i.e., the support of a; is a strict subset of S. 
Using the same arguments as in the proof of Theorem 2, we 
find that x l ~ x contains a block whose norm is at least |x m i n |. 
Therefore, we can consider a modified estimation problem, in 
which r l is a noisy measurement vector of the block sparse 
signal x 1 ^ 1 . Together with (36), this implies that (51) holds for 
the modified setting. Therefore, by (52), the block in r l having 
maximal correlation with the measurements is an element of 5\ 
Consequently, BOMP will correctly identify a support element 
in the £th iteration. Since the BOMP algorithm never selects a 
previously chosen support element, we find by induction that 
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the support set S will be identified in full after s iterations. 
If s < k, then the remaining k — s iterations will identify 
arbitrary off-support elements. 

Denoting by S the complete fc-element support set identified 
by the BOMP approach, we thus have S C S. Following the 
technique (108)-(109) used in the proof of Theorem 4 thus 
yields the required result (37). ■ 
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